Contents

1 Introduction

Head and neck squamous cell carcinoma (HNSCC) is a heterogenous group of tumors at many anatomical sites in the oral cavity, pharynx and larynx originating from mucosal epithelium (Ferlay et al. 2019). HNSCC is the 6th most common cancer worldwide, with a median diagnosis of 66 years, and represents ~90% of head and neck cancers (“Head and Neck Squamous Cell Carcinoma | Nature Reviews Disease Primers,” n.d.). HNSCC accounts for more than 600,000 new cases and 350,000 deaths annually. Tobacco smoking, alcohol use, and areca nuts are major risk factors in developing HNSCC. Cancer arising in the oral cavity or larynx are typically associated with tobacco and alcohol use, whereas Oropharyngeal tumors- affecting the tonsil, back 1/3 of the tongue, and throat- are almost always HPV positive. HPV positive HNSCC is associated with high-risk HPV subtypes, namely HPV16, and is associated with better outcomes when compared to HPV negative HNSCC (Michaud et al. 2014).

The different anatomical site and etiology of HPV positive and negative HNSCC suggest that these two diseases have different molecular characteristics. Understanding the molecular differences between HPV positive and negative HNSCC is crucial for developing targeted therapies and improving patient outcomes.This analysis aims to identify differentially expressed genes (DEGs) between HPV positive and HPV negative HNSCC samples using RNA-seq data from NCBI GEO.The data set from Wood, O, et al. (GSE72536) and the manuscript “Gene expression analysis of TIL rich HPV driven head and neck tumors reveals a distinct B-cell signature when compared to HPV independent tumours” was used (Wood et al. 2016). Here, we performed quality assessment, preprocessing, batch effect correction, differential expression analysis, and functional enrichment analysis to elucidate the molecular differences associated with HPV status in HNSCC.

2 Methods

Quality assessment

Raw counts were inspected for library size, count distribution, sample correlation, and principal components. Library sizes were computed as column sums of the raw count matrix and visualized with a barplot. Per‑sample count distributions were visualized as log2‑CPM density plots and boxplots. Sample–sample Pearson correlations were computed on log2‑CPM and displayed as a heatmap. Principal component analysis (PCA) was performed on the top variable genes (top 20% by variance) and plotted to inspect sample clustering and potential confounders.

Preprocessing

Genes with low expression were filtered using edgeR::filterByExpr() (Robinson, McCarthy, and Smyth 2010) to retain genes with sufficient counts for statistical testing. TMM normalization was applied with edgeR::calcNormFactors(). For visualization, log2‑CPM with a prior count of 1 was used. Gene identifiers were mapped to Entrez IDs using AnnotationDbi::mapIds(org.Hs.eg.db)(Carlson 2019)

Batch effect correction

Batch variables were unknown so surrogate variables were estimated with sva::svaseq()(Leek and Storey 2007) and included surrogate variables in the differential expression model; surrogate variables were also regressed out for visualization. PCA plots and hierarchical clustering (top 10% most variable genes) are shown before and after correction.

Differential expression analysis

Differential expression was performed with edgeR (Robinson, McCarthy, and Smyth 2010) using a generalized linear model. The design matrix included the primary factor (HPV status) and surrogate variables estimated by SVA (Leek and Storey 2007). Dispersion was estimated with estimateDisp(), models were fit with glmFit(), and tests were performed with glmLRT(). P‑values were adjusted for multiple testing using the Benjamini–Hochberg procedure. Genes with FDR < 0.05 and |log2FC| ≥ 1 were considered significant. Top differentially expressed genes and boxplots for the top 10 genes are shown.

Functional enrichment analysis

Gene Ontology enrichment (BP, MF, CC) and KEGG pathway enrichment were performed using clusterProfiler (Yu et al. 2012). Over‑representation analysis (ORA) used enrichGO/enrichKEGG on the set of significant genes (FDR < 0.05, |log2FC| ≥ 1) with the tested gene universe defined as all mapped Entrez IDs. Gene set enrichment analysis (GSEA) used a ranked gene list by logFC and FDR. Multiple testing correction used Benjamini–Hochberg; terms with adjusted p < 0.05 were reported. KEGG queries used the KEGG REST API. Visualizations include dotplots and pathview (Luo and Brouwer 2013) for KEGG pathway diagram.

3 Results

3.1 Figure 1. Quality assessment of RNA-seq data

First, we performed quality assessment of the RNA-seq data to ensure data integrity for downstream analysis. Figure 1a provides an overview of the BIOS658 project workflow.
Figure 1a. Overview of BIOS658 project.

Figure 1: Figure 1a
Overview of BIOS658 project.

Next, the library size distribution across samples was plotted to assess sequencing depth. Total counts ranged from ~7 million to ~18 million reads per sample,indicating variability in sequencing, but consistent coverage and no extreme outliers (fig 1b).
Figure 1b. Library size distribution across samples.

Figure 2: Figure 1b
Library size distribution across samples.

We further wanted to evaluate the expression distribution between samples. log2-CPM density and box plot showed similar, and perhaps better stabilized distributions across samples, indicating successful normalization (fig 1c, 1d).
Figure 1c. Log2-CPM density plot across samples.

Figure 3: Figure 1c
Log2-CPM density plot across samples.

Figure 1d. Log2-CPM boxplot across samples.

Figure 4: Figure 1d
Log2-CPM boxplot across samples.

Lastly, sample–sample Pearson correlation heatmap revealed high correlation coefficients, confirming data integrity. Furthermore, no obvious outliers or unexpected batch clustering.
Figure 1e. Sample–sample correlation heatmap.

Figure 5: Figure 1e
Sample–sample correlation heatmap.

3.2 Figure 2. Batch effect correction and hierarchical clustering

Principal component analysis (PCA) and hierarchical clustering were performed to assess batch effects and sample clustering based on HPV status. First, PCA plots before and after batch effect correction are shown in Figure 2a and 2b, respectively. Before correction, samples clustered primarily by HPV status, indicating that biological variation was the dominant source of variation. After SVA correction, the clustering pattern remained similar, suggesting that batch effects were minimal and did not confound the biological signal.

Figure 2a. PCA plot before batch correction.

Figure 6: Figure 2a
PCA plot before batch correction.

Figure 2b. PCA plot after batch correction.

Figure 7: Figure 2b
PCA plot after batch correction.

Likewise, We next performed hierarchical clustering using the top 10% most variable genes to further assess sample relationships. Before batch effect correction, samples clustered mainly by HPV status (Figure 2c). After SVA correction, the clustering pattern remained consistent (Figure 2d), reinforcing that HPV status was the primary driver of sample variation rather than batch effects.
Figure 2c. Hierarchical clustering before correction.

Figure 8: Figure 2c
Hierarchical clustering before correction.

Figure 2d. Hierarchical clustering after correction.

Figure 9: Figure 2d
Hierarchical clustering after correction.

3.3 Figure 3. Differential expression analysis results

After successful Quality assurance and batch correction, we performed differential expression analysis using edgeR to identify genes differentially expressed between HPV positive and negative samples. The MA plot (Figure 3a) illustrates the overall distribution of log2 fold changes versus average expression, highlighting significantly upregulated and downregulated genes, spanning a wide range of expression levels. This provides evidence that both highly and moderately expressed genes contribute to HPV +/- related expression profiles.
Figure 3a. MA plot of differential expression.

Figure 10: Figure 3a
MA plot of differential expression.

A heatmap of the top 50 differentially expressed genes (Figure 3b) shows distinct expression patterns between HPV positive and negative samples, indicating clear separation based on HPV status.
Figure 3b. Heatmap of top 50 DE genes.

Figure 11: Figure 3b
Heatmap of top 50 DE genes.

The volcano plot (Figure 3c) visualizes the relationship between log2 fold changes and statistical significance, with several genes exhibiting both high fold changes and low p-values. Genes like SYCP2, STAG3, TEX15, ZNF541, and ZPBP2 are significantly upregulated in HPV + tumors. Genes like ADIPOQ,SHISA6, PPP1R14C, KLC1, and CFAP262 are significantly downregulated in HPV + tumors. This indicates that HPV + tumors have a distinct transcriptional profile compared to HPV - tumors, with many genes related to immune modulation.
Figure 3c. Volcano plot of DE results.

Figure 12: Figure 3c
Volcano plot of DE results.

Next, boxplots of the top 10 differentially expressed genes (Figure 3d) demonstrate consistent expression differences between the two groups. Most genes have higher expression in HPV + tumors, but with SHISA6 notably lower in HPV + tumors. This plot further validating the differential expression results.
Figure 3d. Boxplots of top 10 DE genes.

Figure 13: Figure 3d
Boxplots of top 10 DE genes.

3.4 Figure 4. Functional enrichment analysis results

To further interpret the biological significance of the differentially expressed genes, we performed functional enrichment analysis using over-representation analysis (ORA) for Gene Ontology (GO) terms and KEGG pathways. First, GO Biological Process enrichment results (Figure 4a) revealed significant enrichment in immune-related processes, such as Leukocyte cell-cell adhesion, regulation of lymphocyte activation, lymphocyte proliferation, and Immune receptor signaling. This suggests that HPV positive tumors may have a more active immune microenvironment compared to HPV negative tumors.
Figure 4a. GO Biological Process enrichment .

Figure 14: Figure 4a
GO Biological Process enrichment .

We also performed GO Molecular Function enrichment analysis (Figure 4b) and GO Cellular Component Enrichment (4c). Molecular function enrichment highlighted terms related to immune functions, including GTPase regulator activity, cytokine receptor activity, MHC protein complex binding, and SH3 domain binding. This indicates an active role of immune signaling and regulation in HPV positive tumors. The Cellular Component enrichment analysis (Figure 4c) further supported these findings, with significant terms including external side of plasma membrane, immunological synapse, and collagen-containing extracellular matrix.These results not only point towards a direct immune involvement, but also suggest extracellular remodeling in HPV positive tumors.
Figure 4b. GO Molecular Function enrichment (ORA).

Figure 15: Figure 4b
GO Molecular Function enrichment (ORA).

Figure 4c. GO Cellular Component enrichment (ORA).

Figure 16: Figure 4c
GO Cellular Component enrichment (ORA).

Lastly, KEGG pathway enrichment analysis (Figure 4d) was performed and identified multiple top pathways related to immune responses, including Cytokine-cytokine interaction, Cell adhesion molecules (CAMs), NK cell mediated cytotoxicity, Th17/Th1/Th2 cell differentiation and antigen presentation. These findings further emphasize the prominent role of immune processes in distinguishing HPV positive from HPV negative HNSCC tumors.
Figure 4d. KEGG pathway enrichment (ORA).

Figure 17: Figure 4d
KEGG pathway enrichment (ORA).

3.5 Figure 5. KEGG pathway map

To visualize the biological involvement of the differentially expressed genes, we generated a KEGG pathway map.KEGG ORA identified antigen processing and presentation as differentially expressed in HPV+ HNSCC. Viruses are well known to alter MHC and antigen presentation as an immune evasion mechanism.Figure 5 shows the KEGG pathway map for Antigen processing and presentation (hsa04612).The KEGG pathway reveals upregulation of MHC l presentation to CD8+ T cells via TAP1/2, as well as MHC II presentation to CD4+ T cells via HLA-DM and HLA-DO. There are many explanations for these findings. The simplest is that with the presence of viral antigens, there is increased antigen presentation to T cells. Another possibility is that antigen presentation is enhanced in HPV+ cancer.
Figure 5. KEGG pathway map (Antigen processing and presentation, hsa04612).

Figure 18: Figure 5
KEGG pathway map (Antigen processing and presentation, hsa04612).

4 Discussion

In this study, we analyzed RNA-seq data from HPV positive and negative head and neck squamous cell carcinoma (HNSCC) samples to identify differentially expressed genes (DEGs) associated with HPV status. Quality assessment confirmed data integrity. Bar graphs and box plots of library size and log2-CPM were performed and revealed relatively uniform counts. There was a range of ~ 10 million counts between the high and low samples, but no outliers were detected. This is expected as the NCBI GEO data set GSE72536 was already filtered by the authors prior to uploading. The authors exluded 16 samples in the original study (39 HNSCC samples down to 23 samples), selecting for TIL-rich tumors.This could explain any differences in results, due to a second layer of quality control. Nonetheless, samples were originally selected to be immune rich, and as expected, we see an increase in immune-related GO terms after GO analysis.Next, we assessed batch effects using PCA and hierarchical clustering. In this analysis, PCA plots before and after batch effect correction were performed as part of a standard pipeline. Before SVA correction, the PCA plot (Figure 2a) showed strong clustering around HPV status that remained after SVA correction, indicating that HPV status was a major source of variation in the data. Hierarchical clustering (Figure 2c and 2d) further supported this, with samples clustering primarily by HPV status rather than batch. This suggests that while batch effects were present, they were minor and did not defer from the biological question.

Differential expression analysis using edgeR identified a substantial number of DEGs between HPV positive and negative samples. The MA plot (Figure 3a) illustrated a clear separation of upregulated and downregulated genes across a range of expression levels. This suggest that differences between HPV + and HPV - samples are not due to a handful of highly expressed genes. The heatmap of the top 50 DEGs (Figure 3b) demonstrated distinct expression patterns between the two groups,and The volcano plot (Figure 3c) highlighted several genes with both high fold changes and statistical significance, including SYCP2, STAG3, TEX15, ZNF541, and ZPBP2 as significantly upregulated in HPV positive tumors. Boxplots of the top 10 DEGs (Figure 3d) confirmed consistent expression differences between HPV positive and negative samples.Wood et al, also showed a strong seperation of HPV+ and HPV- samples based on DEGs. They noted an immune signature enrichment in HPV+ samples, which is consistent with our findings. Despite this, there are some differences in the specific DEGs identified, likely due to differences in analysis pipelines and filtering criteria. Wood, et al. key finding was a distinct B cell signature in HPV+ tumors, which included CD200, ADAM28, BCL2, VCAM1, and STAG3. In their pipeline, they did not use SVA for batch correction, instead they added CD19,CD4 and CD8 as covariates. This could explain some of the differences in DEGs identified. Nevertheless, some of the top GO terms from our analysis align with Wood et al’s findings, particularly those related to immune processes.

Woods, et al. revealed distinct B cell signature in HPV+ tumors, which included CD200, ADAM28, BCL2, VCAM1, and STAG3, when TILs were corrected for. This analysis did not recreate the original paper, but provided further synergistic evidence that HPV+ tumors have a distinct immune microenvironment compared to HPV- tumors, when an additional layer of QC and filtering was applied. Together, these results highlight the connection between HPV status, Immune interaction, and better prognosis in HNSCC. Further research is required to elucidate the mechanistic links and therapeutic implications of these findings.

References

Carlson, Marc. 2019. AnnotationDbi: Annotation Database Interface. https://bioconductor.org/packages/AnnotationDbi.
Ferlay, J., M. Colombet, I. Soerjomataram, C. Mathers, D. M. Parkin, M. Piñeros, A. Znaor, and F. Bray. 2019. “Estimating the Global Cancer Incidence and Mortality in 2018: GLOBOCAN Sources and Methods.” International Journal of Cancer 144 (8): 1941–53. https://doi.org/10.1002/ijc.31937.
“Head and Neck Squamous Cell Carcinoma | Nature Reviews Disease Primers.” n.d. https://www-nature-com.proxy.library.vcu.edu/articles/s41572-020-00224-3.
Leek, J. T., and J. D. Storey. 2007. “Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis.” PLoS Genetics 3 (9): 1724–35. https://doi.org/10.1371/journal.pgen.0030161.
Luo, W., and C. Brouwer. 2013. “Pathview: An r/Bioconductor Package for Pathway-Based Data Integration and Visualization.” Bioinformatics 29 (14): 1830–31. https://doi.org/10.1093/bioinformatics/btt285.
Michaud, D. S., S. M. Langevin, M. Eliot, H. H. Nelson, M. Pawlita, M. D. McClean, and K. T. Kelsey. 2014. “High-Risk HPV Types and Head and Neck Cancer.” International Journal of Cancer 135 (7): 1653–61. https://doi.org/10.1002/ijc.28811.
Robinson, M. D., D. J. McCarthy, and G. K. Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.
Wood, O., J. Woo, G. Seumois, N. Savelyeva, K. J. McCann, D. Singh, T. Jones, et al. 2016. “Gene Expression Analysis of TIL Rich HPV-Driven Head and Neck Tumors Reveals a Distinct b-Cell Signature When Compared to HPV Independent Tumors.” Oncotarget 7 (35): 56781–97. https://doi.org/10.18632/oncotarget.10788.
Yu, G., L. G. Wang, Y. Han, and Q. Y. He. 2012. “clusterProfiler: An r Package for Comparing Biological Themes Among Gene Clusters.” OMICS: A Journal of Integrative Biology 16 (5): 284–87. https://doi.org/10.1089/omi.2011.0118.

Appendix

```